Chapter 4: Clustering and classification

RStudio exercise 4: Data analysis and wrangling

Data analysis

2. Data

The Boston dataset contains data from a study on housing values in suburbs of Boston. With the data you can for example study the effect of crime rates or air quality on the value of owner-ocupied homes. You can learn more about the data here.

The dataset contains 506 observations on 14 variables. The observations are numeric with most variables taking natural number values. There are also some with integer values and one dichotomous variable.

# load the data
data("Boston")

# glimpse at the dataset
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

3. Overview of the data

Next we will look at the data in a bit more detail. Start by printing out the column/variable names and writing down the definitions of the different variables (from here):

colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
Variable Definition
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitrogen oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted mean of distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat lower status of the population (percent)
medv median value of owner-occupied homes in $1000s

From the summary of the variables we can see minimum, maximum, median and mean values as well as the 1st and 3rd quartiles of the variables.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can study and visualize the correlations between the different variables with the help of a correlations matrix and a correlations plot.

# First calculate the correlation matrix and round it so that it includes only two digits:
cor_matrix<-cor(Boston) %>% round(digits = 2)

# Print the correlation matrix:
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# Visualize the correlation matrix with a correlations plot:
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

From the plot above we can easily see which variables correlate with which and is that correlation positive or negative. So for example, we can see that the distance from the employment centres (dis) correlates negatively with age of the houses (age), nitrogen oxides concentration (nox) and proportion of non-retail business acres (indus). We can also see that accessibility to radial highways (rad) is highly positively correlated with property taxes (tax), as are industrial land use (indus) and air pollution (nox). On the first glimpse, the correlation appear quite intuitive and nothing stands out as very suprising. One could note that the correlation of the only dummy variable chas is almost negligible with all other variables. This could be because the overall share of observations where chas = 1 is very small.

4. Standardizing and scaling the dataset

We will later use linear discriminant analysis on our data, so we need to have varibles that are normally distributed and share the same covariance matrix across different classes. This requires the data to be scaled before fitting a model.

Thus we want to standardize the data set by scaling it. This is very easy to do in R by using the scale() function, which subtracts the mean of a variable from each value of the variable and divides this by the standard error. From the summary we can see that the standardized values are much more similar in magnitude across the different variables, so their relationships are easier to grasp. Note that the mean of each variable is now zero, as it should be. Also note that previously all variables were positive, but now at least the min values are negative for all variables. This is because the mean of all variables is larger than the smallest value of a variable.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Note that the scaled data set is now in matrix format. We want to change the class of the scaled data set into a dataframe, which we can do by converting it with a function as.data.frame():

# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Next we want to create a categorical (or factor) variable of the crime rate in the scaled Boston data set. We will use the quantiles as the break points in the categorical variable. We will print the summary of the scaled crime rate, the quantile vector of this variable and a table of the new factor variable to check that everything goes as intended.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

In order to avoid confusion, we want to drop the old crime rate variable from the dataset and replace it with the new categorical variable for crime rates. This is easily done with the following two lines of code:

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Lastly for this step, we will divide the dataset to train and test sets. This will allow as further down the line to check how well our model works. We will assign 80 % of the data to the train set and the remaining 20 % to the test set. The training of the model is done with the train set and predictions on new data is conducted on the test set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

5. Linear discriminant analysis

Now we will fit the linear discriminant analysis on the train set. The LDA is a classification method that finds the (linear) combination of the variables that separate the target variable classes. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2400990 0.2574257 0.2500000 
## 
## Group means:
##                   zn      indus          chas        nox          rm
## low       0.95313734 -0.8920844 -0.0793339581 -0.8664560  0.43807213
## med_low  -0.06509931 -0.2951030  0.0117915679 -0.5979929 -0.11185772
## med_high -0.40395960  0.2353055  0.1819517287  0.3866572 -0.06085731
## high     -0.48724019  1.0149946  0.0005392655  1.0522191 -0.42284726
##                 age        dis        rad        tax     ptratio
## low      -0.9062951  0.8977263 -0.6868728 -0.7373246 -0.41147812
## med_low  -0.3844676  0.3848914 -0.5615559 -0.5305818 -0.05707988
## med_high  0.4165048 -0.3824748 -0.4164721 -0.2947116 -0.25216256
## high      0.8359851 -0.8645371  1.6596029  1.5294129  0.80577843
##               black      lstat         medv
## low       0.3792107 -0.7693183  0.534112876
## med_low   0.3524186 -0.1727156  0.007978516
## med_high  0.1023849  0.0866083  0.068152872
## high     -0.7900314  0.8681775 -0.662513146
## 
## Coefficients of linear discriminants:
##                  LD1          LD2        LD3
## zn       0.156321053  0.676225751 -0.8905832
## indus   -0.019547429 -0.200159101  0.4065107
## chas    -0.001110237 -0.045017648  0.1011732
## nox      0.274713478 -0.641845168 -1.4216678
## rm       0.027025317  0.002053063 -0.1407244
## age      0.333763752 -0.320024481 -0.1078748
## dis     -0.191770489 -0.179940972  0.1811128
## rad      3.400254119  1.058560042  0.1862033
## tax      0.033258236 -0.190826857  0.3077754
## ptratio  0.169242647  0.080546275 -0.2537049
## black   -0.171057921  0.018548903  0.1741080
## lstat    0.106331768 -0.213377185  0.2439645
## medv     0.030763748 -0.309750086 -0.3225459
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9577 0.0325 0.0099

The LDA calculates the probabilities of a new observation belonging to each of the classes by using the trained model. After that the model classifies each observation to the most probable class.

The most convenient and informative way to consider the results of an LDA is to visualize with a biplot. This we can do following the Datacamp exercise with the following code:

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The arrows in this plot represent the relationships of the original variables and the LDA solution.

6. LDA predicting performance

In this part we want to see how the LDA model performs when predicting on new data. For this end, we can use the predict() function. we start by predicting the crime classes with the test data.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       9        0    0
##   med_low    2      17       10    0
##   med_high   0       4       17    1
##   high       0       0        1   25

From the cross tabulation of the results we can see that the model appears to predict the correct results reasonably well. The model has most problems in separating med_low from low, but most of the predictions seems to fall in the correct class.

7. Clustering

Next we will take few steps towards clustering the data. Start by reloading the Boston dataset and standardize the dataset again. We need to do this again, because we have made also other changes to dataset.

# load the data
data("Boston")

# center and standardize variables
boston_scaled2 <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled2)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled)

Next we wish to calculate the distances between the observations. We do this by forming a standard Euclidian distance matrix:

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
## Warning in dist(boston_scaled2): NAs introduced by coercion
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5267  4.9081  4.9394  6.2421 13.0045

I ran into trouble here because there are NA values in the boston_scaled2 distance matrix. The k-means algorithm wouldn’t run on this dataset, retuning an error on the NA values. Despite googling I was not able to find a solution, so I’m doing the rest of this exercise with the original Boston dataset (as in the Datacamp exercises). So, a novel try:

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047

From the summary we can see the relevant moments of the calculated distances between the observations.

Then let’s move on to K-means clustering, that assigns observations to groups or clusters based on similarity of the objects. The following code runs k-means algorithm on the dataset:

# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

It’s difficult to see anything from this, so let’s zoom in a bit and look at only the last five columns:

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

Next we want to know what the optimal number of clusters is. We can do this by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

From the plot above it’s evident that optimal number of clusters is not three, but two. Let’s run the k-means algorithm again with two clusters and then visualize the results:

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

Super-bonus

Start by running the code below for the (scaled) train data that we used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
## Warning: package 'plotly' was built under R version 3.4.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Then we will use the package plotly to create a 3D plot of the columns of the matrix product:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: package 'bindrcpp' was built under R version 3.4.2

This is very cool indeed! But addingdifferent colors for the different crime rate classes definetly makes it even cooler still:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

Lastly, we are supposed to define the color by the clusters of the k-means. Here I must be doing something wrong, because the colors disappear completely.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km)

Data wrangling

You can find my preprocessed data set and the data wrangling R script from my GitHub repository.